score difference
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (6 more...)
- North America > Canada > Alberta (0.14)
- Asia > Singapore (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (11 more...)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.67)
- Education > Educational Technology > Educational Software (0.46)
- Education > Curriculum > Subject-Specific Education (0.45)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.92)
- Information Technology > Artificial Intelligence > Cognitive Science (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Jordan (0.04)
- Leisure & Entertainment > Games (1.00)
- Leisure & Entertainment > Sports > Hockey (0.48)
Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
Ren, Xingyu, Sun, Youran, Liang, Haoyu
We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $\tilde{e} + μ$, where $μ$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $σ$ on retrieval tasks, 3.1 $σ$ on classification tasks, and 0.8 $σ$ on other types of tasks. Renormalization has two variants: directly subtracting $μ$ from $e$, or subtracting the projection of $e$ onto $μ$. We theoretically predict that the latter performs better, and our experiments confirm this prediction.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (6 more...)
- North America > Canada > Alberta (0.14)
- Asia > Singapore (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (11 more...)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.67)
- Education > Educational Technology > Educational Software (0.46)
- Education > Curriculum > Subject-Specific Education (0.45)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.92)
- Information Technology > Artificial Intelligence > Cognitive Science (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Comparison of Scoring Rationales Between Large Language Models and Human Raters
Hua, Haowei, Jiao, Hong, Song, Dan
Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking'' of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.
- North America > United States > Iowa (0.04)
- North America > United States > Maryland (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.88)
- North America > Canada > Ontario > Toronto (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Robots (0.69)
On Monotonicity in AI Alignment
Bareilles, Gilles, Fageot, Julien, Hoang, Lê-Nguyên, Blanchard, Peva, Bouaziz, Wassim, Rouault, Sébastien, El-Mhamdi, El-Mahdi
Comparison-based preference learning has become central to the alignment of AI models with human preferences. However, these methods may behave counterintuitively. After empirically observing that, when accounting for a preference for response $y$ over $z$, the model may actually decrease the probability (and reward) of generating $y$ (an observation also made by others), this paper investigates the root causes of (non) monotonicity, for a general comparison-based preference learning framework that subsumes Direct Preference Optimization (DPO), Generalized Preference Optimization (GPO) and Generalized Bradley-Terry (GBT). Under mild assumptions, we prove that such methods still satisfy what we call local pairwise monotonicity. We also provide a bouquet of formalizations of monotonicity, and identify sufficient conditions for their guarantee, thereby providing a toolbox to evaluate how prone learning models are to monotonicity violations. These results clarify the limitations of current methods and provide guidance for developing more trustworthy preference learning algorithms.
- Europe > Austria > Vienna (0.14)
- North America > Canada > British Columbia > Vancouver (0.05)
- Oceania > Palau (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)